feat: voice-activity streaming mode & inner-vad for speech-to-text module#1160
Conversation
694fe4f to
1c2411e
Compare
02113ff to
6bba141
Compare
1c2411e to
0ea858d
Compare
This comment was marked as resolved.
This comment was marked as resolved.
| } | ||
| })(); | ||
|
|
||
| while (this.isStreaming && !finished) { |
There was a problem hiding this comment.
stream() resolves as soon as this.isStreaming flips, but the native loop only re-checks the flag at the top of the next iteration — so for up to timeout + one inference after await streamStop() returns, the native streamer is still alive, can still queue callInvoker_->invokeAsync callbacks, and still touches audioBuffer_. If the caller then runs unload() (or the host object is destroyed) we're in UAF / use-after-unload territory.
Two options: (a) actually join — stream() doesn't resolve until the native stream() call returns, and streamStop() awaits that; or (b) document explicitly that unload() is not safe immediately after streamStop() and that callbacks may fire after the promise resolves. (a) is the safer contract.
chmjkb
left a comment
There was a problem hiding this comment.
besides my previous comment, I think it looks good, great work!
4ef78de to
ae72c0d
Compare
msluszniak
left a comment
There was a problem hiding this comment.
Review from local verification (VAD native tests pass 15/15, demo app boots). A few correctness items and minor cleanups inline.
The size check on audioBuffer_ raced with streamInsert writes under audioBufferMutex_. Move both the size comparison and the erase under a single lock so the read isn't concurrent with vector mutation.
generate() ran unlocked against a std::span pointing into audioBuffer_, relying on the vector's reservation never being exceeded. Unbounded streamInsert from JS could grow the buffer past capacity, trigger reallocation, and invalidate the span. Take a local copy under the lock instead so the inference operates on stable data.
Previously `lastMerged.end = current.end` would shrink the merged segment if a non-monotonic input arrived (current.end < lastMerged.end). postprocess() doesn't produce such input today, but the safer form removes the hidden invariant.
isStreaming stayed true after the native stream() resolved (whether normally or via error), so subsequent code relying on the flag saw stale state. Reset it in a finally block alongside the wake/finished bookkeeping.
Mutating the in-flight `options` to flip `useVAD` off before the final `finish()` call worked but left a footgun for anything later that reads back `options.useVAD`. Build a local copy with the override instead.
The function only reads from the span. Tagging it const signals intent and matches the equivalent OnlineASR::insertAudioChunk signature on the STT side.
`||` coerced an explicit `0` to the default 500. Switch to `??` so callers can pass 0 to disable the margin.
Explain what the 1.2 multiplier means — widens the VAD merge window relative to the user-configured detectionMargin so brief intra-utterance silences don't split a single utterance into separate segments.
`OnlineASR::process` computes the silence-trim cut as a `size_t` subtraction of these two constants. If either is tweaked such that the ordering inverts, the subtraction wraps and the subsequent `erase` reads past the buffer. Lock the invariant in at compile time.
SpeechToText gained a 4th positional `vadSource` argument; pass an empty string at all 9 existing call sites so the test still exercises the no-VAD path. Add the new VAD sources to the CMake target so the binary links.
mergeSegments: empty input, single-segment passthrough, distant segments stay separate, close/adjacent segments merge, overlapping shorter inner doesn't shrink the result, mixed sequence merges only adjacent close pairs. stream/streamInsert/streamStop: stream() loop exits promptly on streamStop, streamInsert while streaming doesn't crash, concurrent stream() throws StreamingInProgress, and stream can be restarted after a stop.
Covers the new PR behavior: - valid vadSource constructs without throwing - invalid vadSource fails loudly - one-shot transcribe() is unaffected when VAD is loaded - stream(useVAD=true) on a model built without VAD throws - stream(useVAD=true) over pure-silence audio drives the VAD branch of OnlineASR::process and exits cleanly via streamStop() Also register fsmn-vad in run_tests.sh so the SpeechToTextTests runner pushes the VAD model alongside the Whisper artifacts.
Required by the const-span signature of VoiceActivityDetection:: streamInsert. Without this, ModelHostObject's template instantiation of synchronousHostFunction<&VoiceActivityDetection::streamInsert> references an undefined symbol at link time on Android. Mirrors the existing std::span<float> specialization; the underlying getTypedArrayAsSpan<float>() helper returns a span over the same storage, which converts implicitly to span<const float>.
Description
This PR introduces changes focused on voice-activity-detection module and it's utilization within the library:
VoiceActivityDetectionScreenand changes the behavior ofSpeechToTextScreen, adding a toggle to switch the VAD submodule for STT on/off.Introduces a breaking change?
Type of change
Tested on
Testing instructions
VoiceActivityDetectionScreenwithin theSpeechdemo app.SpeechToTextScreenwithin theSpeechdemo app, with VAD toggle on.Screenshots
Related issues
#1118
Checklist
Additional notes